Search CORE

69 research outputs found

HMM-Based Speech Synthesis Utilizing Glottal Inverse Filtering

Author: Alku P.
Nurminen J.
Pulakka H.
Raitio T.
Suni A.
Vainio M.
Yamagishi J.
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2011
Field of study

Comparing human and automatic speech recognition in a perceptual restoration experiment

Author: Alku P.
Brown G.J.
Juvela L.
Kurimo M.
López A.R.
Palomäki K.
Remes U.
Publication venue: 'Elsevier BV'
Publication date: 24/06/2015
Field of study

Speech that has been distorted by introducing spectral or temporal gaps is still perceived as continuous and complete by human listeners, so long as the gaps are filled with additive noise of sufficient intensity. When such perceptual restoration occurs, the speech is also more intelligible compared to the case in which noise has not been added in the gaps. This observation has motivated so-called 'missing data' systems for automatic speech recognition (ASR), but there have been few attempts to determine whether such systems are a good model of perceptual restoration in human listeners. Accordingly, the current paper evaluates missing data ASR in a perceptual restoration task. We evaluated two systems that use a new approach to bounded marginalisation in the cepstral domain, and a bounded conditional mean imputation method. Both methods model available speech information as a clean-speech posterior distribution that is subsequently passed to an ASR system. The proposed missing data ASR systems were evaluated using distorted speech, in which spectro-temporal gaps were optionally filled with additive noise. Speech recognition performance of the proposed systems was compared against a baseline ASR system, and with human speech recognition performance on the same task. We conclude that missing data methods improve speech recognition performance in a manner that is consistent with perceptual restoration in human listeners

White Rose Research Online

Comparing glottal-flow-excited statistical parametric speech synthesis methods

Author: Alku P.
Raitio T.
Suni A.
Vainio M.
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/05/2013
Field of study

Edinburgh Research Explorer

On measuring the intelligibility of synthetic speech in noise #x2014; Do we need a realistic noise environment?

Author: Alku P.
Raitio T.
Santala O.
Suni A.
Takanen M.
Vainio M.
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/03/2012
Field of study

Edinburgh Research Explorer

Sensitivity of the human auditory cortex to acoustic degradation of speech and non-speech sounds

The perception of speech is usually an effortless and reliable process even in highly adverse listening conditions. In addition to external sound sources, the intelligibility of speech can be reduced by degradation of the structure of speech signal itself, for example by digital compression of sound. This kind of distortion may be even more detrimental to speech intelligibility than external distortion, given that the auditory system will not be able to utilize sound source-specific acoustic features, such as spatial location, to separate the distortion from the speech signal. The perceptual consequences of acoustic distortions on speech intelligibility have been extensively studied. However, the cortical mechanisms of speech perception in adverse listening conditions are not well known at present, particularly in situations where the speech signal itself is distorted. The aim of this thesis was to investigate the cortical mechanisms underlying speech perception in conditions where speech is less intelligible due to external distortion or as a result of digital compression. In the studies of this thesis, the intelligibility of speech was varied either by digital compression or addition of stochastic noise. Cortical activity related to the speech stimuli was measured using magnetoencephalography (MEG). The results indicated that degradation of speech sounds by digital compression enhanced the evoked responses originating from the auditory cortex, whereas addition of stochastic noise did not modulate the cortical responses. Furthermore, it was shown that if the distortion was presented continuously in the background, the transient activity of auditory cortex was delayed. On the perceptual level, digital compression reduced the comprehensibility of speech more than additive stochastic noise. In addition, it was also demonstrated that prior knowledge of speech content enhanced the intelligibility of distorted speech substantially, and this perceptual change was associated with an increase in cortical activity within several regions adjacent to auditory cortex. In conclusion, the results of this thesis show that the auditory cortex is very sensitive to the acoustic features of the distortion, while at later processing stages, several cortical areas reflect the intelligibility of speech. These findings suggest that the auditory system rapidly adapts to the variability of the auditory environment, and can efficiently utilize previous knowledge of speech content in deciphering acoustically degraded speech signals.Puheen havaitseminen on useimmiten vaivatonta ja luotettavaa myös erittäin huonoissa kuunteluolosuhteissa. Puheen ymmärrettävyys voi kuitenkin heikentyä ympäristön häiriölähteiden lisäksi myös silloin, kun puhesignaalin rakennetta muutetaan esimerkiksi pakkaamalla digitaalista ääntä. Tällainen häiriö voi heikentää ymmärrettävyyttä jopa ulkoisia häiriöitä voimakkaammin, koska kuulojärjestelmä ei pysty hyödyntämään äänilähteen ominaisuuksia, kuten äänen tulosuuntaa, häiriön erottelemisessa puheesta. Akustisten häiriöiden vaikutuksia puheen havaitsemiseen on tutkttu laajalti, mutta havaitsemiseen liittyvät aivomekanismit tunnetaan edelleen melko puutteelisesti etenkin tilanteissa, joissa itse puhesignaali on laadultaan heikentynyt. Tämän väitöskirjan tavoitteena oli tutkia puheen havaitsemisen aivomekanismeja tilanteissa, joissa puhesignaali on vaikeammin ymmärrettävissä joko ulkoisen äänilähteen tai digitaalisen pakkauksen vuoksi. Väitöskirjan neljässä osatutkimuksessa lyhyiden puheäänien ja jatkuvan puheen ymmärrettävyyttä muokattiin joko digitaalisen pakkauksen kautta tai lisäämällä puhesignaaliin satunnaiskohinaa. Puheärsykkeisiin liittyvää aivotoimintaa tutkittiin magnetoenkefalografia-mittauksilla. Tutkimuksissa havaittiin, että kuuloaivokuorella syntyneet herätevasteet voimistuivat, kun puheääntä pakattiin digitaalisesti. Sen sijaan puheääniin lisätty satunnaiskohina ei vaikuttanut herätevasteisiin. Edelleen, mikäli puheäänien taustalla esitettiin jatkuvaa häiriötä, kuuloaivokuoren aktivoituminen viivästyi häiriön intensiteetin kasvaessa. Kuuntelukokeissa havaittiin, että digitaalinen pakkaus heikentää puheäänien ymmärrettävyyttä voimakkaammin kuin satunnaiskohina. Lisäksi osoitettiin, että aiempi tieto puheen sisällöstä paransi merkittävästi häiriöisen puheen ymmärrettävyyttä, mikä heijastui aivotoimintaan kuuloaivokuoren viereisillä aivoalueilla siten, että ymmärrettävä puhe aiheutti suuremman aktivaation kuin heikosti ymmärrettävä puhe. Väitöskirjan tulokset osoittavat, että kuuloaivokuori on erittäin herkkä puheäänien akustisille häiriöille, ja myöhemmissä prosessoinnin vaiheissa useat kuuloaivokuoren viereiset aivoalueet heijastavat puheen ymmärrettävyyttä. Tulosten mukaan voi olettaa, että kuulojärjestelmä mukautuu nopeasti ääniympäristön vaihteluihin muun muassa hyödyntämällä aiempaa tietoa puheen sisällöstä tulkitessaan häiriöistä puhesignaalia

Crossref

Springer - Publisher Connector

PubMed Central

Aaltodoc Publication Archive

Atypical perceptual narrowing in prematurely born infants is associated with compromised language acquisition at 2 years of age

Author: Alku P.
Alku P.
Hallman M.
Hallman M.
Huotilainen M.
Huotilainen M.
Jansson-Verkasalo E.
Jansson-Verkasalo E.
Kaukola T.
Kaukola T.
Kushnerenko E.
Kushnerenko E.
Luotonen M.
Luotonen M.
Ruusuvirta T.
Ruusuvirta T.
Rytky S.
Rytky S.
Suominen K.
Suominen K.
Tolonen U.
Tolonen U.
Publication venue
Publication date: 01/01/2010
Field of study

Background: Early auditory experiences are a prerequisite for speech and language acquisition. In healthy children, phoneme discrimination abilities improve for native and degrade for unfamiliar, socially irrelevant phoneme contrasts between 6 and 12 months of age as the brain tunes itself to, and specializes in the native spoken language. This process is known as perceptual narrowing, and has been found to predict normal native language acquisition. Prematurely born infants are known to be at an elevated risk for later language problems, but it remains unclear whether these problems relate to early perceptual narrowing. To address this question, we investigated early neurophysiological phoneme discrimination abilities and later language skills in prematurely born infants and in healthy, full-term infants. Results: Our follow-up study shows for the first time that perceptual narrowing for non-native phoneme contrasts found in the healthy controls at 12 months was not observed in very prematurely born infants. An electric mismatch response of the brain indicated that whereas full-term infants gradually lost their ability to discriminate non-native phonemes from 6 to 12 months of age, prematurely born infants kept on this ability. Language performance tested at the age of 2 years showed a significant delay in the prematurely born group. Moreover, those infants who did not become specialized in native phonemes at the age of one year, performed worse in the communicative language test (MacArthur Communicative Development Inventories) at the age of two years. Thus, decline in sensitivity to non-native phonemes served as a predictor for further language development. Conclusion: Our data suggest that detrimental effects of prematurity on language skills are based on the low degree of specialization to native language early in development. Moreover, delayed or atypical perceptual narrowing was associated with slower language acquisition. The results hence suggest that language problems related to prematurity may partially originate already from this early tuning stage of language acquisition

UEL Research Repository at University of East London

Using group delay functions from all-pole models for speaker recognition

Author: Alku Paavo
Bimbot F.
Cerisara C.
Fougeron C.
Gravier G.
Kinnunen Tomi H.
Lamel L.
Pellegrino F.
Perrier P.
Pohjalainen Jouni
Rajan Padmanabhan
Publication venue: Isc-Int Speech Communication Association
Publication date: 01/01/2013
Field of study

Bu çalışma, 25-29 Ağustos 2013 tarihlerinde Lyon[Fransa]'da düzenlenen 14. Annual Conference of the International Speech Communication Association [Interspeech 2013]'da bildiri olarak sunulmuştur.Popular features for speech processing, such as mel-frequency cepstral coefficients (MFCCs), are derived from the short-term magnitude spectrum, whereas the phase spectrum remains unused. While the common argument to use only the magnitude spectrum is that the human ear is phase-deaf, phase-based features have remained less explored due to additional signal processing difficulties they introduce. A useful representation of the phase is the group delay function, but its robust computation remains difficult. This paper advocates the use of group delay functions derived from parametric all-pole models instead of their direct computation from the discrete Fourier transform. Using a subset of the vocal effort data in the NIST 2010 speaker recognition evaluation (SRE) corpus, we show that group delay features derived via parametric all-pole models improve recognition accuracy, especially under high vocal effort. Additionally, the group delay features provide comparable or improved accuracy over conventional magnitude-based MFCC features. Thus, the use of group delay functions derived from all-pole models provide an effective way to utilize information from the phase spectrum of speech signals.Academy of Finland (253120)Int Speech Commun AssociationAmazonMicrosoftGoogleTcL SYTRALEuropean Language Resources AssociationOuaeroImaginoveVOCAPIA ResearchAcapelaSpeech OceanALDEBARANOrangeVecsysIBM ResearchRaytheon BBN TechnologyVoxyge

Açık Erişim@BUU

Early detection of continuous and partial audio events using CNN

Author: Alku P.
Ghosh P.K.
Lang Yue
McLoughlin Ian Vince
Murthy H.A.
Narayanan S.
Palaniappan Ramaswamy
Pham Lam Dang
Phan Huy
Prasanna S.R.M.
Rao P.
Sekhar C.C.
Song Yan
Umesh S.
Yegnanarayana B.
Publication venue: 'International Speech Communication Association'
Publication date: 06/09/2018
Field of study

Sound event detection is an extension of the static auditory classification task into continuous environments, where performance depends jointly upon the detection of overlapping events and their correct classification. Several approaches have been published to date which either develop novel classifiers or employ well-trained static classifiers with a detection front-end. This paper takes the latter approach, by combining a proven CNN classifier acting on spectrogram image features, with time-frequency shaped energy detection that identifies seed regions within the spectrogram that are characteristic of auditory energy events. Furthermore, the shape detector is optimised to allow early detection of events as they are developing. Since some sound events naturally have longer durations than others, waiting until completion of entire events before classification may not be practical in a deployed system. The early detection capability of the system is thus evaluated for the classification of partial events. Performance for continuous event detection is shown to be good, with accuracy being maintained well when detecting partial events

Crossref

Kent Academic Repository

PREMIUM, a benchmark on the quantification of the uncertainty of the physical models in the system thermal-hydraulic codes: methodologies and data review

Author: (Coordinators) Mendizabal R.
Alku T.
Amri A.
Baccou J.
Barré F.
Bazin P.
Chung B-D
D-Y Oh
de Alfonso A.
de Crécy A.
Dethioux A.
Dong Li
D’Auria Francesco Saverio
Falkov A.
Fouet F.
Gusev A.
Heo J. S.
Jaeger W.
Janssens M.
Kissane M.
Kopecek V.
Kovtonyuk A.
Kurki J.
Kyncl M.
Liu X.
Léna C.
Meca R.
Pernica R.
Petruzzi A.
Probst P.
Reventos F.
Segurado J.
Shvetsov Y.
Skorek T.
Sánchez V.
Wicaksono D.
Zerkak O.
Zhang J.
Publication venue: 'Organisation for Economic Co-Operation and Development (OECD)'
Publication date: 01/01/2016
Field of study

The objective of the Post-BEMUSE Reflood Model Input Uncertainty Methods (PREMIUM) benchmark is to progress on the issue of the quantification of the uncertainty of the physical models in system thermalhydraulic codes by considering a concrete case: the physical models involved in the prediction of core reflooding. The present document was initially conceived as a final report for the Phase I “Introduction and Methodology Review” of the PREMIUM benchmark. The objective of Phase I is to refine the definition of the benchmark and publish the available methodologies of model input uncertainty quantification relevant to the objectives of the benchmark. In this initial version the document was approved by WGAMA and has shown its usefulness during the subsequent phases of the project. Once Phase IV was completed, and following the suggestion of WGAMA members, the document was updated adding a few new sections, particularly the description of four new methodologies that were developed during this activity. Such developments were performed by some participants while contributing to PREMIUM progress (which is why this report arrives after those of other phases). After this revision the document title was changed to “PREMIUM methodologies and data review”. The introduction includes first a chapter devoted to contextualization of the benchmark in nuclear safety research and licensing, followed by a description of the PREMIUM objectives. Next, a description of the Phases in which the benchmark is divided and its organization is explained. Chapter two consists of a review of the involvement of the different participants, making a brief explanation of the input uncertainty quantification methodologies used in the activity. The document ends with some conclusions on the development of Phase I, some more general remarks and some statements on the benefits of the benchmark, which can be briefly summarized as it follows: - Contribution to development of tools and experience related to uncertainty calculation and promotion of the use of BEPU approaches for licensing and safety assessment purposes; - Contribution to prioritization of improvements to thermal-hydraulic system codes; - Contribution to a fluent and close interaction between the scientific community and regulatory organizations. Appendices include the complete description of the experimental data FEBA/SEFLEX used in the benchmark and the methodologies CIRCÉ and FFTBM and the general requirements and description specification used for Phase I. Due to the revision of the document, four extra appendixes have been added related to the methods developed during the activity, MCDA DIPE, Tractebel IUQ and PSI methods

Archivio della Ricerca - Università di Pisa

Non-hexagonal neural dynamics in vowel space

Author: Bruneau N Roux S, Guerin P, et al.
Bullmore ET Suckling J, Overmeyer S, et al.
Chaumon M Bishop DVM, Busch NA
Constaninescu AO O'Reilly JX, Behrens TEJ
Delorme A Makeig S
Doeller CF Barry C, Burgess N
Fox RA
Fyhn M Hafting T, Treves A, et al.
Fyhn M Hafting T, Treves A, et al.
Gay T
Goslin J Galluzzi C, Romani C
Jezek K Henriksen EJ, Treves A, et al.
Khalighinejad B da Silva GC, Mesgarani N
Kropff E Treves A
Maidenbaum S Miller J, Stein JM, et al.
Manca AD Di Russo F, Sigona F, et al.
Mesgarani N Cheung C, Johnson K, et al.
Miller GA Nicely PE
Mäkelä AM Alku P, May PJ, et al.
Näätänen R Picton T
Picton TW Hillyard SA, Krausz HI, et al.
Scharinger M Idsardi WJ, Poe S
Schwartz JL Boë LJ, Vallée N, et al.
Skipper JI Devlin JT, Lametti DR
Staudigl T Leszczynski M, Jacobs J, et al.
Zwicker E
Publication venue: 'American Institute of Mathematical Sciences (AIMS)'
Publication date: 01/01/2020
Field of study

Are the grid cells discovered in rodents relevant to human cognition? Following up on two seminal studies by others, we aimed to check whether an approximate 6-fold, grid-like symmetry shows up in the cortical activity of humans who "navigate" between vowels, given that vowel space can be approximated with a continuous trapezoidal 2D manifold, spanned by the first and second formant frequencies. We created 30 vowel trajectories in the assumedly flat central portion of the trapezoid. Each of these trajectories had a duration of 240 milliseconds, with a steady start and end point on the perimeter of a "wheel". We hypothesized that if the neural representation of this "box" is similar to that of rodent grid units, there should be an at least partial hexagonal (6-fold) symmetry in the EEG response of participants who navigate it. We have not found any dominant n-fold symmetry, however, but instead, using PCAs, we find indications that the vowel representation may reflect phonetic features, as positioned on the vowel manifold. The suggestion, therefore, is that vowels are encoded in relation to their salient sensory-perceptual variables, and are not assigned to arbitrary gridlike abstract maps. Finally, we explored the relationship between the first PCA eigenvector and putative vowel attractors for native Italian speakers, who served as the subjects in our study

Crossref

Sissa Digital Library